Statisticians often use the term estimate for a value calculated from the data at hand, to draw a distinction between what we see from the data and the theoretical true or exact state of affairs. Data scientists and business analysts are more likely to refer to such a value as a metric. The difference reflects the approach of statistics versus that of data science: accounting for uncertainty lies at the heart of the discipline of statistics, whereas concrete business or organizational objectives are the focus of data science. Hence, statisticians estimate, and data scientists measure.
Peter Bruce, Andrew Bruce, Peter Gedeck - Practical Statistics for Data Scientists_ 50+ Essential Concepts Using R and Python-O’Reilly Media (2020)
1. Discrete Distribution Sampler
The Discrete Sampler class was designed to randomly pick samples from a list of possible values. For Monte Carlo methods, sampling from a pre-existing value distribution is needed. The Discrete_Distribution_Sampler samples values from a list of values using one of the following approaches:
UniformSample: Select a random value from the distribution that is uniformly sampled from the minimum to maximum sample of distribution: sample = random.uniform(min(distribution), max(distribution))
BinSample: Simply pick a random value from the distribution (ie. from an ordered list). Also known in the literature as bootstrapping sample = distribution[random.uniform(num_elements)]
Jittered BinSample: Pick a random value from the distribution (BinSample) and then add a proportionally small amount of random noise. sample = BinSampleDistribution + jittered_noise
Code
import numpy as npimport pandas as pdimport math as mathimport statistics as statsimport randomfrom collections import Counterimport khipu_kamayuq as kamayuq # A Khipu Maker is known (in Quechua) as a Khipu Kamayuqimport khipu_qollqa as kqimport khipu_utils as kufrom pandas import Series, DataFrame# Plotlyimport plotlyfrom plotly.offline import iplot, init_notebook_modeimport plotly.graph_objs as goimport plotly.express as pximport plotly.figure_factory as ffplotly.offline.init_notebook_mode(connected =False)
Code
class Discrete_Sampler():def__init__(self, values): values.sort()self.values = valuesself.log_values = [math.log(x) for x inself.values if x >0]self.min_value =min(values)self.max_value =max(values)self.mean_value = stats.mean(values)self.stddev = stats.stdev(values) counter = Counter(values)self.max_occurences = counter.most_common(1)[0][1]self.frequency_counter =sorted(counter.most_common(), key=lambda x: x[0])self.np_frequency_values = np.array([y_val for (y_val, num_occurences) inself.frequency_counter])def__repr__(self): the_rep =f"Discrete_Sampler({len(self.values)} samples=)" the_rep +=f"\n\t{self.min_value=}{self.max_value=}{self.mean_value=}{self.stddev=}" the_rep +=f"\n\t{self.frequency_counter[0:5]=}" the_rep +=f"\n\t{self.max_occurences=}" the_rep +="\n"return the_repdef plot_violin_frequency(self, title=None, use_log_value=True): df = pd.DataFrame(zip(self.values, self.log_values), columns=['value', 'log_value']) legend_text = title if title else"<b>Sample Values</b>" legend_text +=" <span style=\"font-size:.8em;\">Hover over points to see values</span>" fig = (px.violin(df, y="log_value"if use_log_value else"value", points='all', labels={"log_value": "Log<sub>e</sub>(Value)", "value": "Value"}, hover_data=['value', 'log_value'], title=legend_text, width=944, height=944).show())def uniform_sample(self, num_samples=1):return [random.uniform(self.min_value, self.max_value) for _ inrange(num_samples)]def bin_sample(self, num_samples=1):return [random.choice(self.values) for _ inrange(num_samples)]def jittered_bin_sample(self, num_samples=1):def jitter_sample(x): sample = x + (random.uniform(-0.5, 0.5))*.1*xreturn ku.clip(sample, self.min_value, self.max_value) samples =self.bin_sample(num_samples=num_samples)return [jitter_sample(x) for x in samples]def continuous_sample(self, num_samples=1):def one_sample(x, y):# Test to see if the point x,y (num_occurences, value) is in the cumulative density distribution# y_index = sum([(y_val < y) for (y_val, num_occurences) in self.frequency_counter]) y_index =len(self.np_frequency_values[self.np_frequency_values < y]) # Faster than the aboveif y_index >len(self.frequency_counter)-1: returnNone next_y_index = y_index+1if y_index <len(self.frequency_counter)-1else y_index (y1,y2) = (self.np_frequency_values[y_index], self.np_frequency_values[next_y_index]) lerp_ratio =0if y2-y1==0elsefloat(y-y1)/float(y2-y1) (x1, x2) = (self.frequency_counter[y_index][1], self.frequency_counter[next_y_index][1]) interpolated_num_occurences =round(ku.linear_interpolate(x1,x2, lerp_ratio))# In the distribution??? in_CDF = ((x <= interpolated_num_occurences) and (y <=self.max_value) and (y_index <len(self.frequency_counter)-2))# If the point is in the distribution, then return the interpolated valuereturn y if in_CDF elseNone# Now start generating random sample points, and add them if they are in the distribution samples = []whilelen(samples) < num_samples: y = random.randint(self.min_value, self.max_value)# Trim x to be left triangular part of "rect of [max occurences X max value]"# This reduces number of false sample points by half, speeding up algorithm considerably lerp_ratio = (y-self.min_value)/(self.max_value-self.min_value) x = random.randint(1, round(ku.linear_interpolate(1,self.max_occurences, lerp_ratio))) sample = one_sample(x, y)if sample: samples.append(round(sample))return samples
2. Building a sample distribution
The database of KFG cord values will be used as the sample distribution. Note that 0 values are ignored. This is reasonable since it makes no difference to summers for 0 cords (they are ignored). It also allows us to see log values of the distribution more easily.
Code
# Use NON_ZERO cords in the khipu database as sample values(khipu_dict, all_khipus) = kamayuq.fetch_khipus()sample_values = []for aKhipu in all_khipus: sample_values += [aCord.knotted_value() for aCord in aKhipu.pendant_cords() if aCord.knotted_value() >0]the_discrete_sampler = Discrete_Sampler(sample_values)print(f"{the_discrete_sampler=}")the_discrete_sampler.plot_violin_frequency(title="Discrete Samples", use_log_value=False)the_discrete_sampler.plot_violin_frequency(title="Discrete Samples - Log<sub>e</sub>(Value)", use_log_value=True)